Cases: Distance Calculations and Initial Clusters

Initial analysis and clustering of the case-only data

Brian Muchmore https://github.com/bmuchmore (GENYO)http://www.genyo.es/
December 09, 2018

NOTE

Given the same data, the following commands should yield almost identical results to the results being shown with any discrepancies being stochastic in nature. Some of these commands, however, can take a long time to run, so while we show the commands here as we originally ran them, results are often being read back from file. If you are trying to recapitulate these results using the exact data and code being used here and are running into problems or incongruitous results, please submit an issue, and we will address it as soon as possible.

Set-up

All data analysis was done on a desktop with 8 cores/16 threads (AMD Ryzen 7 1800x) and 32 GB of DDR4 memory. We begin by setting a “cores” variable for future use.


cores <- 8

Next is a general purpose naming function that will be used to generate any paths seen below:


name <- function(path = "/home/brian/Desktop/flow", folder = "cases", file, type = "rds") {
  paste0(path, "/", folder, "/", file, ".", type)
}

Package Information

These are the packages I will be using.


library(PreciseDist)
library(readr)
library(future)
library(doFuture)
library(heatmaply)

This is the session info.


sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] heatmaply_0.15.2       viridis_0.5.1         
 [3] viridisLite_0.3.0      plotly_4.7.1          
 [5] ggplot2_3.0.0          doFuture_0.6.0        
 [7] iterators_1.0.10       foreach_1.4.4         
 [9] future_1.9.0           readr_1.1.1           
[11] PreciseDist_0.0.0.9000

loaded via a namespace (and not attached):
  [1] R.utils_2.6.0           tidyselect_0.2.4       
  [3] htmlwidgets_1.2         TSP_1.1-6              
  [5] trimcluster_0.1-2       grid_3.4.4             
  [7] ranger_0.10.1           Rtsne_0.13             
  [9] munsell_0.5.0           codetools_0.2-15       
 [11] SNFtool_2.3.0           miniUI_0.1.1.1         
 [13] misc3d_0.8-4            withr_2.1.2            
 [15] colorspace_1.3-2        longitudinalData_2.4.1 
 [17] Boruta_6.0.0            knitr_1.20             
 [19] geometry_0.3-6          stats4_3.4.4           
 [21] robustbase_0.93-1       dtw_1.20-1             
 [23] dimRed_0.1.0            DistatisR_1.0          
 [25] listenv_0.7.0           radix_0.5.0.9001       
 [27] DistributionUtils_0.5-1 Ckmeans.1d.dp_4.2.2    
 [29] rprojroot_1.3-2         locpol_0.7-0           
 [31] ipred_0.9-6             randomForest_4.6-14    
 [33] gclus_1.3.1             diptest_0.75-7         
 [35] R6_2.2.2                seriation_1.2-3        
 [37] fields_9.6              rpivotTable_0.3.0      
 [39] flexmix_2.3-14          manipulateWidget_0.10.0
 [41] DRR_0.0.3               bitops_1.0-6           
 [43] assertthat_0.2.0        promises_1.0.1         
 [45] networkD3_0.4           SDMTools_1.1-221       
 [47] scales_1.0.0            nnet_7.3-12            
 [49] mmtsne_0.1.0            gtable_0.2.0           
 [51] ddalpha_1.3.4           globals_0.12.1         
 [53] spam_2.2-0              timeDate_3043.102      
 [55] rlang_0.2.2             CVST_0.2-2             
 [57] RcppRoll_0.3.0          profileModel_0.5-9     
 [59] splines_3.4.4           lazyeval_0.2.1         
 [61] ModelMetrics_1.1.0      princurve_2.1.0        
 [63] checkmate_1.8.5         trelliscopejs_0.1.14   
 [65] broom_0.5.0             heatmap.plus_1.3       
 [67] rgl_0.99.16             yaml_2.2.0             
 [69] reshape2_1.4.3          abind_1.4-5            
 [71] threejs_0.3.1           crosstalk_1.0.0        
 [73] backports_1.1.3         httpuv_1.4.4.2         
 [75] caret_6.0-80            tools_3.4.4            
 [77] lava_1.6.2              infer_0.3.1            
 [79] gplots_3.0.1            RColorBrewer_1.1-2     
 [81] proxy_0.4-22            BiocGenerics_0.24.0    
 [83] analogue_0.17-0         Rcpp_0.12.18           
 [85] splus2R_1.2-2           plyr_1.8.4             
 [87] visNetwork_2.0.4        base64enc_0.1-3        
 [89] progress_1.2.0          purrr_0.2.5            
 [91] prettyunits_1.0.2       rpart_4.1-13           
 [93] diffusr_0.1.4           zoo_1.8-3              
 [95] sfsmisc_1.1-2           cluster_2.0.7-1        
 [97] magrittr_1.5            data.table_1.11.4      
 [99] TSclust_1.2.4           mvtnorm_1.0-8          
[101] whisker_0.3-2           matrixStats_0.53.1     
[103] hms_0.4.2               NetPreProc_1.1         
[105] mime_0.6                evaluate_0.11          
[107] xtable_1.8-3            mclust_5.4.1           
[109] gridExtra_2.3           compiler_3.4.4         
[111] tibble_1.4.2            maps_3.3.0             
[113] mgc_1.0.1               KernSmooth_2.23-15     
[115] crayon_1.3.4            R.oo_1.22.0            
[117] htmltools_0.3.6         mgcv_1.8-23            
[119] later_0.7.3             tidyr_0.8.1            
[121] RcppParallel_4.4.1      lubridate_1.7.4        
[123] magic_1.5-8             fpc_2.1-11             
[125] autocogs_0.0.1          MASS_7.3-49            
[127] Matrix_1.2-14           permute_0.9-4          
[129] gdata_2.18.0            wmtsa_2.0-3            
[131] R.methodsS3_1.7.1       dotCall64_1.0-0        
[133] bindr_0.1.1             gower_0.1.2            
[135] igraph_1.2.1            ifultools_2.0-4        
[137] pkgconfig_2.0.1         registry_0.5           
[139] brglm_0.6.1             ExPosition_2.8.19      
[141] philentropy_0.2.0       microbenchmark_1.4-4   
[143] recipes_0.1.3           clv_0.3-2.1            
[145] webshot_0.5.0           prodlim_2018.04.18     
[147] LPStimeSeries_1.0-5     stringr_1.3.1          
[149] digest_0.6.18           pls_2.6-0              
[151] vegan_2.5-2             graph_1.56.0           
[153] rmarkdown_1.10          dendextend_1.8.0       
[155] uwot_0.0.0.9004         kernlab_0.9-26         
[157] gtools_3.8.1            modeltools_0.2-22      
[159] shiny_1.1.0             nlme_3.1-131           
[161] glasso_1.10             jsonlite_1.5           
[163] bindrcpp_0.2.2          alluvial_0.1-2         
[165] TSdist_3.4              pillar_1.3.0           
[167] lattice_0.20-35         httr_1.3.1             
[169] DEoptimR_1.0-8          survival_2.41-3        
[171] glue_1.3.0              xts_0.11-0             
[173] prabclus_2.2-6          class_7.3-14           
[175] stringi_1.2.3           pdc_1.0.3              
[177] KODAMA_1.5              rsample_0.0.2          
[179] caTools_1.17.1          dplyr_0.7.6            
[181] hglasso_1.2            

Distance Calculations

The first step is to take our flow case data and calculate as many distances, similarities and correlations as possible. Some of these calculations may take seconds while others may take hours, so we generally run this overnight and then kill the function if it is still running in the morning (if the file parameter is set, once a distance finishes it is written to file, so killing the function does not harm already written data).


library(future)
library(doFuture)
options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = cores)
flow_case_dists <- flow_cases %>%
  as.matrix() %>%
  precise_dist(
    dists = "all_dists",
    suffix = "",
    file = name(file = "flow_case_dists"),
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )

Now we coerce all non-distances (i.e. similarities and correlations) into distances.


flow_case_distances <- flow_case_dists %>%
  precise_transform(enforce_dist = TRUE)

UMAP Results for Every Distance

Now that we have a large list of distances, we can feed each distance as input into the UMAP algorithm.


options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 4)
flow_case_umap <- precise_umap(
  data = flow_case_distances,
  distance = TRUE,
  n_neighbors = 5,
  spread = 10,
  min_dist = 0.1,
  bandwidth = 1,
  type = "plotly",
  color_vec = NULL,
  colors = NULL,
  parallel = TRUE,
  verbose = TRUE
)

Once we have a list of UMAP results, we will feed it to precise_trellis(), which will automatically take our list of UMAP results and output a visualization for each result. The point of this is to look for UMAP results that indicate the calculated distance is capturing some amount of intrinsic structure within the data.

NOTE: This is an interactive visualization. You can also maximized and minimize it by clicking on the juxtaposed arrows in the bottom right hand corner of the vizualization.


precise_trellis(
  data = flow_case_umap, 
  name = "UMAP of Every Flow Distance",
  path = name(file = "trellis_flow_case_umap"), 
  self_contained = TRUE
)

It can be good to view the data without any coloring to get a sense for the structure potentially present and nothing else. In this case, however, we know that PBMCs from the flow cytometry data should correlate with any structure we see. That is, if the UMAP visualization shows patients with a high or low number of PBMCs scattered about instead of adjacent to one another, it is doubtful we can trust that distance to have acurately captured the patterns in the data. So, we run the same UMAP analysis, but we will color each patient with their repective number of PBMCs.


options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 4)
flow_case_pbmc_umap <- precise_umap(
  data = flow_case_distances,
  distance = TRUE,
  n_neighbors = 5,
  spread = 10,
  min_dist = 0.1,
  bandwidth = 1,
  type = "plotly",
  color_vec = flow_cases[["pbmc"]],
  colors = "default",
  parallel = TRUE,
  verbose = TRUE
)

NOTE: If the vizualization is not appearing, simply scroll to the next image, and it should now appear.


precise_trellis(
  data = flow_case_pbmc_umap, 
  name = "UMAP of Every Flow Distance Colored by PBMCs",
  path = name(file = "trellis_flow_case_pbmc_umap"), 
  self_contained = TRUE
)

By scrolling through the various UMAP results, we see that a number of the results look similar, but it is not clear if different distances really are similar or if it just looks that way. To make the results more clear, we will call precise_correlations() to create a correlation matrix of distances.


flow_case_distance_correlations <- precise_correlations(
  data = flow_case_distances,
  method = "pearson",
  permutations = 101,
  parallel = FALSE,
  verbose = TRUE
)

Now that we have the correlations, we can visualize the correlation matrix using heatmaply_cor().


heatmaply_cor(flow_case_distance_correlations$statistic)

At this point, we now have a better understanding of how each distance has captured the signal in the data, and how the distances are related to each other. The goal now is to see if we can improve the signal in a single distance or a group of distances. We begin by choosing one promising distance that showed little correlation to other distances and some clear structure when visualized with UMAP. The distance we choose here is soergel, and to increase it’s signal we will set the partitions = 10, which will run the distance 10 times and add 10% noise to each run.


options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 5)
flow_case_soergel_dists <- flow_cases %>%
  as.matrix() %>%
  precise_dist(
    dists = "soergel",
    suffix = "",
    partitions = 10,
    file = name(file = "flow_case_soergel_dists"),
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )

Now we will transform each of the distances and then fuse them into a single distance. This step is largely trial-and-error to see which transformations increase the signal and structure present in each distance, although the transformations shown here are often well suited for a variety of problems. In addition, being the simplest fusion algorithm, fusion = fuse is generally the first fusion option to try.


flow_case_soergel_fusion <- flow_case_soergel_dists %>%
  precise_transform(return_list = TRUE) %>%
  precise_transform(enforce_sim = TRUE) %>%
  precise_transform(fixed_k = 100) %>%
  precise_transform(transform = "laplacian") %>%
  precise_transform(enforce_dist = TRUE) %>%
  precise_fusion(fusion = "fuse")

Now that we have a single fused soergel distance, we will build a graph.


library(future)
library(doFuture)
options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 8)
flow_case_soergel_graph <- precise_graph(
  data = flow_case_soergel_fusion,
  method = 5,
  distance = FALSE,
  n_neighbors = 75,
  spread = 1,
  min_dist = 0.0,
  bandwidth = 1,
  parallel = TRUE,
  verbose = TRUE
)

And then visualize it.


flow_case_soergel_2d_plot <- precise_viz(
  data = flow_case_soergel_graph,
  plot_type = "drl_2d_plot",
  k = 50,
  jitter = 2.5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = TRUE
)

Clearly, we have now created a graph with obvious structure. Because there are now clearly 6 clusters, we can use the pamk() function from the fpc package to automatically detect and cluster the 6 clusters we see using the coordinates of the plot as the input data. We do this rather than specifying k = 6 manually in order to show how we can automate the process of clustering when we have such clear results, which will be useful later.


library(fpc)
library(purrr)
library(tibble)
flow_case_soergel_clusters <- flow_case_soergel_2d_plot$plot_layout %>%
  pamk(
    krange = 3:15,
    criterion = "asw", 
    usepam = TRUE,
    scaling = FALSE, 
    alpha = 0.001, 
    diss = FALSE,
    critout = FALSE, 
    ns = 10, 
    seed = NULL
  ) %>%
  .[[1]] %>%
  .[["clustering"]] %>%
  as.character() %>%
  map_chr(~paste0("cluster_", .x)) %>%
  as_tibble() %>%
  select(pam_clusters = value)

The final step of this initial analysis is to see if the clusters mean anything. Because the clusters are supposed to be a reflection of the flow cytometry data, by mapping each flow cytometry feature onto the clusters we should be able to get a sense of whether our clusters are a valid reflection of the data or if they are just noise.


flow_case_soergel_descriptors <- precise_descriptors(
  flow_case_soergel_2d_plot,
  descriptors = cbind(flow_case_soergel_clusters, flow_cases),
  verbose = TRUE,
  rank = TRUE,
  size = 0.5
)

If one scrolls through the visualization, it will become quickly apparent that we have capture legitimate signal in the data.


precise_trellis(
  data = flow_case_soergel_descriptors, 
  name = "Flow Columns and Cluster Results Mapped to Visual Results of Fused Soergel Distances",
  path = name(file = "trellis_flow_case_soergel_descriptors"), 
  self_contained = TRUE
)

How legitimate is the signal though? We can get a sense for the answer to this question by comparing our soergel clusters to clusterings of each flow cytometry feature. The logic is the following: An individual cluster and a clustering as a whole should be representative of the features. If we take PBMCs as an example, we would expect a single cluster to be fairly homogenous with the PBMC values contained within it. In addition, while we would expect the PBMC values to be similar in one cluster, we would expect the average value to be different than another cluster. One way we can measure this is to cluster each flow cytometry feature using an optimcal K-means algorithm and set k to the same value as the number of soergel clusters. Thus, because we have 6 soergel clusters, we will partition every feature into 6 clusters. We can then use the variation of information distance to see how well the feature clusters overlap with each other and overlap with our soergel clusters.


flow_soergel_vi_distance <- flow_cases %>%
  cbind(flow_case_soergel_clusters) %>%
  precise_clust_cors(
    method = "vi", 
    bin = "clusters"
  )

heatmaply(flow_soergel_vi_distance$statistic)

Our initial analysis is done, and it is clear we have captured signal in the data. Many questions still remain though, such as which features are driving the clusters, are the clusters correlated with any of the clinical data, and is this the best clustering we can achieve or can we improve upon it? It is this last question we will focus on next.